3  Summary statistics: Theory and practice

Author

Vladimir Buskin

3.1 Measures of central tendency

3.1.1 The mean

  • A useful summary statistic is the arithmetic mean \(\bar{x}\) (cf. Heumann, Schomaker, and Shalabh 2022: 38). Consider a variable \(X\) with observations \(x_1, x_2, ..., x_n\) from a sample of size \(n\). The sample mean then corresponds to

    \[ \bar{x}= \frac{x_1 + x_2 + ... + x_n}{n} \\ = \frac{1}{n}\sum_{i=1}^n{x_i}. \]

In R, we can obtain the average value of a numeric vector with the mean() function.

# Using mean()
mean(cl.order$LEN_MC)
[1] 9.265509
# or by hand:
mean <- 1/length(cl.order$LEN_MC) * sum(cl.order$LEN_MC)

Visualisation:

# Plot distribution of LEN_MC
cl.length.hist <- ggplot(cl.order, aes(x = LEN_MC)) +
                  geom_histogram(binwidth = 2)

cl.length.hist +
  # Add mean
  geom_vline(aes(xintercept = mean(LEN_MC)),
             color = "steelblue",
             linewidth = 1) +
  theme_classic()

# Plot distribution of LEN_MC
cl.length.dens <- ggplot(cl.order, aes(x = LEN_MC)) +
                  geom_density()

cl.length.dens +
  # Add mean
  geom_vline(aes(xintercept = mean(LEN_MC)),
             color = "steelblue",
             linewidth = 1) +
  theme_classic()

3.1.2 The median

  • The median() function computes the “the halfway point of the data (50% of the data are above the median; 50% of the data are below” (Winter 2020: 58)

    \[ \begin{equation} \tilde{x}_{0.5} = \begin{cases} x_{((n+1)/2)} & \text{if } n \text{ is odd.} \\ \frac{1}{2}(x_{n/2}+x_{(n/2+1)}) & \text{if } n \text{ is even.} \end{cases} \end{equation} \]

# Using median()
median(cl.order$LEN_MC)
[1] 8
# or by hand:
sample_sorted <- sort(cl.order$LEN_MC) # sort values in ascending order

n <- length(cl.order$LEN_MC) # sample size is 403 (odd number!)

median <- sample_sorted[(n + 1) %/% 2] # compute median

Visualisation:

cl.length.hist +
  # Add mean
  geom_vline(aes(xintercept = mean(LEN_MC)), color = "steelblue", linewidth = 1) +
  # Add median
  geom_vline(aes(xintercept = median(LEN_MC)), color = "red", linewidth = 1) +
  theme_classic()

cl.length.dens +
  # Add mean
  geom_vline(aes(xintercept = mean(LEN_MC)), color = "steelblue", linewidth = 1) +
  # Add median
  geom_vline(aes(xintercept = median(LEN_MC)), color = "red", linewidth = 1) +
  theme_classic()

3.1.3 Sample variance and standard deviation

  • In order to assess how well the mean represents the data, it is instructive to compute the variance var() and the standard deviation sd() for a sample.

  • The sample variance is defined as

\[s^2 = \frac{1}{n}\sum_{i=1}^n{(x_i - \bar{x})^2}. \]

# Using var()
var(cl.order$LEN_MC)
[1] 25.12585
# or by hand:

sample_data <- cl.order$LEN_MC

# Calculate the sample standard deviation

var <- 1 / length(sample_data) * sum((sample_data - mean(sample_data))^2) # formula above

# Note that R's var() function applies an additional bias correction measure:

var_corrected <- 1 / (length(sample_data) - 1) * sum((sample_data - mean(sample_data))^2) # equivalent to var()

\[ s = \sqrt{\frac{1}{n}\sum_{i=1}^n{(x_i - \bar{x})^2}} \]

# Using sd()
sd(cl.order$LEN_MC)
[1] 5.012569
# or by hand:

sample_data <- cl.order$LEN_MC

# Calculate the sample standard deviation

sd <- sqrt(1 / (length(sample_data) - 1)* sum((sample_data - mean(sample_data))^2))

Application and visualisation:

cl.length.hist +
  # Add verticle line for the mean
  geom_vline(aes(xintercept = mean(LEN_MC)), color = "steelblue", linewidth = 1) +
  # Add -1sd
  geom_vline(aes(xintercept = mean(LEN_MC) - sd(LEN_MC)), color = "orange", linewidth = 1) +
  # Add +1sd
  geom_vline(aes(xintercept = mean(LEN_MC) + sd(LEN_MC)), color = "orange", linewidth = 1) +
  theme_classic()

# Create data frame with mean and sd for each clause ORDER

cl.order %>% 
  # Select variables of interest
  select(ORDER, LEN_MC) %>% 
  # Group results of following operations by ORDER
  group_by(ORDER) %>% 
    # Create grouped summary of mean and sd for each ORDER
    summarise(mean = mean(LEN_MC),
                sd = sd(LEN_MC)) -> cl_mean_sd; cl_mean_sd
# A tibble: 2 × 3
  ORDER  mean    sd
  <chr> <dbl> <dbl>
1 mc-sc  9.04  4.91
2 sc-mc  9.75  5.22
# Plot results 

ggplot(cl_mean_sd, aes(x = ORDER, y = mean)) +
  # Barplot with a specific variable mapped onto y-axis
  geom_col() +
  # Add mean and standard deviation to the plot
  geom_errorbar(aes(x = ORDER,
                    ymin = mean-sd,
                    ymax = mean+sd), width = .2) +
  theme_classic() +
  labs(y = "Mean length of main clauses", x = "Clause order")

3.1.4 Quantiles

  • While median() divides the data into two equal sets (i.e., two 50% quantiles), the quantile() function makes it possible to partition the data further.

    quantile(cl.order$LEN_MC)
      0%  25%  50%  75% 100% 
       2    6    8   11   31 
  • quantile(x, 0) and quantile(x, 1) thus show the minimum and maximum values, respectively.

    quantile(cl.order$LEN_MC, 0)
    0% 
     2 
    quantile(cl.order$LEN_MC, 1)
    100% 
      31 

3.1.5 Quartiles and boxplots

The structure of the boxplot (Wickham, Çetinkaya-Rundel, and Grolemund 2023: Chapter 2.3.1)
  • Consider the distribution of clause length by clause order:
ggplot(cl.order, aes(x = ORDER, y = LEN_MC)) +
  geom_boxplot() +
  theme_classic()

ggplot(cl.order, aes(x = ORDER, y = LEN_MC)) +
  geom_boxplot() +
  geom_jitter() + # add data points 
  theme_classic()

  • Compare it to the corresponding rotated density plot:
ggplot(cl.order, aes(x = LEN_MC, fill = ORDER)) +
  geom_density(alpha = 0.5) +
  coord_flip() +
  theme_classic()

3.2 Theoretical distributions

3.2.1 The normal distribution

A great number of numerical variables in the world follow the well-known normal (or Gaussian) distribution, which includes test scores, weight and height, among many others.

If a random variable \(X\) is normally distributed, it is determined by the parameters \(\mu\) (the mean) and \(\sigma\) (the standard deviation). Formally, we can summarise this using the notation

\[ X \sim N(\mu, \sigma^2).\] The probability density function (PDF) of the normal distribution has a characteristic bell-shape. The density values on the \(y\)-axis indicate the likelihood of encountering a specific value of \(X\) (cf. Winter 2020: 56; Heumann, Schomaker, and Shalabh 2022: 173-177).

3.2.2 Bernoulli distribution

The Bernoulli distribution is a discrete probability distribution for random variables which have only two possible outcomes: “positive” (often coded as 1) and “negative” (often coded as 0). Examples of such variables include coin tosses (heads/tails), binary response questions (yes/no), and defect status (defective/non-defective).

If a random variable \(X\) follows a Bernoulli distribution, it is determined by the parameter \(p\), which is the probability of the positive case:

\[ X \sim Bernoulli(p).\] The probability mass function (PMF) of the Bernoulli distribution is given by: \[ P(X = x) = \begin{cases} p & \text{if } x = 1 \\ 1 - p & \text{if } x = 0 \end{cases} \]

where \(0 \leq p \leq 1\). This function shows the probability of \(X\) taking on the value of 1 or 0 (cf. Heumann, Schomaker, and Shalabh 2022: 162-163).

Extensions

A Bernoulli experiment presupposes a single trial (e.g., tossing a coin once). If we are interested in the distribution of a binary discrete variable over \(n\) Bernoulli trials, we can describe it in terms of the binomial distribution (Heumann, Schomaker, and Shalabh 2022: 163-166).

Categorical variables with more than 2 outcomes and \(n\) Bernoulli trials can be modelled using the multinomial distribution (Heumann, Schomaker, and Shalabh 2022: 167-169).